Effective population size in field pea

Background Effective population size (Ne) is a pivotal parameter in population genetics as it can provide information on the rate of inbreeding and the contemporary status of genetic diversity in breeding populations. The population with smaller Ne can lead to faster inbreeding, with little potential for genetic gain making selections ineffective. The importance of Ne has become increasingly recognized in plant breeding, which can help breeders monitor and enhance the genetic variability or redesign their selection protocols. Here, we present the first Ne estimates based on linkage disequilibrium (LD) in the pea genome. Results We calculated and compared Ne using SNP markers from North Dakota State University (NDSU) modern breeding lines and United States Department of Agriculture (USDA) diversity panel. The extent of LD was highly variable not only between populations but also among different regions and chromosomes of the genome. Overall, NDSU had a higher and longer-range LD than the USDA that could extend up to 500 Kb, with a genome-wide average r2 of 0.57 (vs 0.34), likely due to its lower recombination rates and the selection background. The estimated Ne for the USDA was nearly three-fold higher (Ne = 174) than NDSU (Ne = 64), which can be confounded by a high degree of population structure due to the selfing nature of pea. Conclusions Our results provided insights into the genetic diversity of the germplasm studied, which can guide plant breeders to actively monitor Ne in successive cycles of breeding to sustain viability of the breeding efforts in the long term. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-024-10587-6.


Introduction
Dry pea (Pisum sativum L.) is a diploid, cool-season legume and a member of the Leguminosae family [1].Pea is one of the most important pulse crops grown in more than 100 countries, where 7,043,605 hectares of dry pea were planted around the world with a total production of 12,403,522 tonnes [2].In the USA alone, the pea production reached one million tonnes in 2019 [3].In recent years, pea protein has become more popular in the market for plant-based diets e.g., Beyond ® Meat Burger [4].Pea seeds have earned a reputation as a dietary goldmine with around 15 -32% protein content, vitamins, folate, fibers, potassium and minerals, which is good for human health and helps prevent cardiovascular and specific cancer diseases [4,5].The increasing popularity of plantbased proteins in the market has further propelled the demand for peas.Therefore, the study of genetic diversity should expand to accelerate the genetic gain of pea varieties to meet future demands, maintaining the diversity in peas is the top priority for plant breeders [4,6].
Estimation of effective population size (N e ) determines the rate of inbreeding [7,8] and genetic changes due to genetic drift [9].N e is an important parameter in population genetics and breeding introduced by Sewall Wright in 1931, which helps breeders to maintain and monitor the level of genetic diversity in their species [10].The estimated N e is expected to be smaller than the census size (N), as it influences the rate at which genetic diversity decreases within a population [11,12].Relatively smaller N e indicates limited population diversity, which, in turn, can restrict genetic advancement within a breeding program [13].Moreover, N e parameter retrieves the population dynamics of the genes [14].
The effective size of a population refers to the hypothetical number of individuals in an idealized population that would exhibit a comparable genetic response to stochastic processes, similar to that observed in a real-world population which is based on the Wright-Fisher model [15][16][17].This model shows genetic drift as the main operating factor, and that changes in allelic and genotypic frequencies over generations are solely influenced by the population size (N) [15].In real-world breeding populations, factors such as mutation, migration, natural selection, and non-random mating come into play [15].These factors affect the actual rates of inbreeding and changes in gene frequency variance observed in a population [18].This will indeed impact N e and therefore, reduce the genetic variation and diversity.The most commonly used extensions for effective population size theory are variance effective size and inbreeding effective size [15].The variance effective size reflects the rate of change in gene frequency variance, while inbreeding effective size corresponds to the rate of inbreeding observed in a population [19].These measures allow us to quantify the consequences of genetic drift in a real population, based on the characteristics and dynamics of the idealized Wright-Fisher population [15].
While N e of a population can be estimated either from demographic data or genetic markers, the latter is preferred [20][21][22].Demographic data involves using census size and variance of reproductive success whereas genetic markers reveal changes in allele frequencies over time and are based on linkage disequilibrium (LD).When the pedigree or demographic data is not available, N e can be estimated using genetic markers [23].The most popular and widely-employed genetic approach has been the temporal method, which relies on temporal fluctuations in allele frequencies observed on multiple samples collected from the same population [14].N e , however, can also be directly estimated using LD between loci at various distances along the genome [13,24].Recent advancements in high-throughput sequencing and the availability of high-density markers such as single nucleotide polymorphisms (SNPs) have increased over the past decade, contributing to the LD-based approach now being acknowledged as more reliable, robust [25], cost and time effective than the temporal approach [9].
Linkage disequilibrium (represented as r 2 ) is a phenomenon characterized by the non-random association of alleles at various loci [26] which became popular in recent years for predicting N e [27].Correlations between alleles are generated by genetic drift when it is inversely proportional to N e [9], which changes the allele frequencies in a population over time.The biggest advantage of LD over the temporal method [28], is the strength of associations between markers that can be used to calculate N e at any time (generations) from a single population accurately without relying on longitudinal data.This makes LD a valuable tool for studying populations where temporal information may be limited or unavailable.Recombination and mutation rates are fundamental processes that shape the genetic landscape [29] and by analyzing LD, we can better understand their history and apply it to plant breeding and population genetics [30].
In this study, we estimated the extent of LD decay in the dry pea genome and utilized the relationship between LD and recombination frequency, as initially described by Sved J [24], to estimate N e which is convenient as it only requires one sampling time [31,32].We used two sets of populations: 1) NDSU modern breeding lines, hereafter referred to as NDSU set, and 2) USDA diversity panel, hereafter referred to as USDA set.Our objectives were two-fold: (i) to estimate N e for these two germplasms set in dry pea and (ii) to compare the genetic variation between these germplasms.To achieve these goals, we developed a comprehensive R package that implements the Sved J [24] formula for N e prediction.This package not only caters to the specific needs of dry pea research but can also be adapted for use in other crop species.Since there has been no information on N e for peas, our findings serve as a valuable reference for researchers seeking to determine the minimum number of lines required for designing experiments.Furthermore, comparing the genetic variation between NDSU modern breeding lines and USDA multi-environmental lines provides valuable information about the diversity and potential of these germplasm collections.This knowledge can guide breeding programs and conservation efforts, ensuring the maintenance and enhancement of genetic resources in dry pea cultivation.

Plant materials
In this study, we used plant materials from two distinct germplasms pool.The first population comes from the NDSU Pulse Breeding Program (NDSU set) where 300 advanced elite lines were generated from multiple bi-parental populations.The NDSU breeding lines represented a set of pre-selected, non-structured, elite advanced lines at the preliminary yield testing stage, which were carefully chosen and contained both contemporary and past elite germplasm [33,34].The breeding lines were built using modern and historical elite cultivars and germplasm in the breeding program, which are representative of a decade of continuous genetic improvement.Further, these selected lines were created specifically with a focus on phenotypes including high yield, grain quality, resistance to disease and some other desirable agronomic traits [33,34].
The second population is from a USDA diversity panel (USDA set), and contained 482 accessions, of which 292 samples were from the Pea Single Plant Plus Collection (Pea PSP) [4,35,36].The USDA set was composed of accessions that represent most of available diversity within the USDA pea germplasm collection based on the knowledge of geography, taxonomy, morphology and genotyping-by-sequencing data generated previously [35].

DNA extraction, sequencing and variant calling
Leaf tissues from the greenhouse were collected at different stages for all NDSU elite lines and USDA accessions.The DNA from the lyophilized tissues were extracted using the DNeasy Plant Mini Kit (Qiagen).Detailed information regarding the tissue collections and extractions are provided in Bari M [4,33].Both NDSU set and USDA set were sequenced using genotyping-by-sequencing (GBS).Using the restriction enzyme ApeKI, dual-indexed GBS libraries for both populations were prepared [37].Samples were sequenced using NovaSeq S1 × 100 Illumina sequencing technologies.The NDSU set sequenced libraries were retrieved with a quality score ≥ 30.For USDA set, FASTQC [38] was utilized to perform quality check and removed reads with lengths < 50 bases.All reads that passed the quality check were aligned with the reference genome [39] (https:// www.pulse db.org).Finally, the aligned reads were analyzed using SAMtools (v1.10) and generated the variant files (VCF) using Free-Bayes (V1.3.2).
The amount of single nucleotide polymorphisms (SNPs) identified for the NDSU set was 28,832, while 380,527 SNP markers were identified in the USDA set [4,34].For these marker datasets, we filtered minor allele frequency (MAF), since alleles with < 5% could produce bias to the LD and N e calculations [40,41].We also removed markers with more than 20% missing values using Plink v1.9 [42] and heterozygosity > 20% using Tassel v5.0 [43].The resulting marker sets consisted of 7,157 (NDSU set) and 19,826 (USDA set) SNP markers that were used for downstream analysis.

Calculation of linkage disequilibrium (r. 2 )
LD was calculated using Plink v1.9 [42] with a maximum distance of 750 kb.Using "ggplot2" R package, the genomewide and chromosome-wide LD-decay (r 2 ) were visualized against the physical distance (kb) to show the recombination history (see Figs. 1 & 2).
LD scores were also estimated using Genome-wide Complex Trait Analysis (GCTA) software for window size of 1000 kb and r 2 cutoff of 0 [44].This approach was employed to visualize the distribution of mean LD throughout the genome.

Calculation of effective population size
Effective population size (N e ) for both the NDSU set and the USDA set were estimated based on LD using the Sved J [24] equation.The recombination rate (cM) was calculated using cM/Mb conversion ratio from a recent pea genetic linkage map [45] and then transformed to Morgan's (c).
where, N e = effective population size.c = genetic distance in Morgan's The expected r 2 was predicted by linear regression model using least square estimation (LSE), Prediction of r 2 : (1) The mean r 2 from the Y parameter was calculated by LD (r 2 ) for the genetic distance 'c' using 'group by' mean function in R Environment [46].Now with the availability of all required parameters, we finally estimated N e from Eq. (1) using LSE.
According to the formula (Eq.1), we assigned the variables as predictor (X) and response (Y) and calculated the coefficient β 1 without the intercept term β 0 , following Juma R [47].
Again, we used Eq.(3) to calculate the coefficient β 1 which represents N e .

Linkage disequilibrium decay rate and scores
The decay of linkage disequilibrium (r 2 ) was examined in both NDSU set and USDA set by utilizing 7,157 and 19,826 SNP markers, respectively.This analysis allowed for the identification of the physical distance at which the decay rate occurred.Supplementary Fig. 1 depicts the distribution of SNPs within and across chromosomes for both populations, providing an illustration of the marker density.The NDSU set's genome-wide LDdecay plot (Fig. 1) demonstrates that the r 2 reached its peak value of 0.57 within the initial kilobases and subsequently exhibited a gradual decline.The r 2 showed a decrease from 0.3 to 0.25 when the genomic distance increased from 150 to 250 kb.Following that, the LD within each chromosome was observed visually in Fig. 2 in order to improve comprehension of the decay pattern.Chromosomes 1 and 6 exhibited a rapid decay at approximately 175 kb, while chromosomes 2 and 5 demonstrated a comparatively slower decay rate of around 350 kb.Furthermore, it is worth noting that chromosome 5 had the higher r 2 value of 0.61 compared to other chromosomes.Whereas, the genomewide LD of USDA set showed that r 2 started at a lower value of 0.34 and dropped rapidly and reached 0.2 and 0.1 at 100 kb and 200 kb (Fig. 1).From the chromosome-wide LD-decay (Fig. 2), we observed that chromosome 3 dropped faster around ~ 150 kb, but the r 2 decreased below 0.1 for chromosomes 4 and 7. Also, chromosomes 1, 5 and 6 decayed slowly (~ 250 kb) and reached r 2 0.1.We also observed that chromosome 1 exhibited a higher r 2 of 0.37.LD-decay figures show the trend of the r 2 decaying from LD to linkage equilibrium (LE).Additionally, we performed calculations of LD scores as an alternative metric for inferring LD.The analysis of local LD in the NDSU set indicates a notable rise in the average r 2 of 0.6 across all chromosomes.The average r 2 of chromosomes 5 and 6 was the highest with 0.8.The genomic interval encompassing the centromeric region of chromosome 2 was missing.In contrast, the USDA set exhibited low average r 2 , with chromosome 2 hardly reaching 0.4, and chromosomes 1, 4, and 7 having few sets that reached 0.3.It is worth noting that the LD density of the NDSU set is comparatively lower than the USDA set (Fig. 3).
With respect to recombination rate (centimorgans-cM), the genome-wide r 2 on average decayed from 0.54 to 0.27 at 0.7 cM for the NDSU set, indicating a moderate level of correlation within this specific genetic distance across the genome.In contrast, the USDA set had lower average r 2 (0.28) which dropped within a shorter genetic distance (0.5 cM).This implies that as the distance between the markers increases to 0.5 cM, they tend to be less correlated with each other (Supplementary Fig. 2).
The level of LD exhibited significant variation across distinct genomic regions and populations of dry peas.The impracticality of conducting whole-genome scanning can be attributed to the excessive number of markers required for such studies, particularly in cases where there is a low level of linkage disequilibrium [48].The USDA set reported a low LD value, indicating a higher occurrence of recombination events.In contrast, the NDSU set showed a higher LD score, suggesting a greater frequency of linked markers presumably due to limited recent recombination to date [49].

Effective population size (N e )
Based on LD, the estimated effective population size (N e ) for both the populations are shown in Fig. 4. The smaller N e and high LD in NDSU set indicates that it has undergone selective pressures leading to reduced diversity and increased correlation between the markers.Given NDSU set's population history and marker density, it is acceptable to state that despite lower N e , it holds a reasonable level of diversity that may help maintain its genetic variability which is essential for long-term viability and adaptability.The USDA set resulted in lower LD and higher N e , meaning it has more diversity and has encountered relatively fewer instances of selective pressures or genetic bottlenecks.It is important to note that the low LD can also be observed in a population with high N e .Thus, it was expected to see NDSU set with lower N e compared to USDA set.These estimates explain how genetic drift and selections have shaped these populations over time.

Discussion
The importance of N e has become increasingly recognized in plant breeding as it describes the rate of inbreeding and can reflect the contemporary status of genetic diversity in breeding populations [50].When N e is low, the population can become quickly inbred with little Fig. 3 The Mean LD scores estimated in 1000 kb windows.There is a significant increase in LD of NDSU set compared to USDA set potential for genetic gain making long-term selection ineffective.Therefore, plant breeders should be cognizant of the effective population size of their breeding program [10].Actively monitoring N e in successive cycles of breeding can enhance the viability of the breeding efforts and help sustain long-term genetic gain.In this study, we presented the first estimation of N e in dry pea using two distinct germplasm sets: 1) the NDSU set consisting of elite breeding lines within the NDSU breeding program, and 2) the USDA set comprised of landraces and plant introductions collected all over the world [35,36].The former represents breeding lines and germplasm in an active breeding program that releases new modern cultivars, while the latter represents germplasm accessions in a repository.As expected, the estimated N e for the USDA set (Ne = 174) was higher than the NDSU set (Ne = 64).The selection and derivation of closely related breeding lines from multiple breeding populations likely resulted to a lower N e estimation in the NDSU set, presumably due to increased inbreeding.The genetic diversity for the USDA set is higher than the NDSU set as it represents most of the available diversity in the USDA pea germplasm collection [35,36].
The N e estimate for the NDSU set was within the same range as those reported in other self-pollinating crops such as rice (Oryza sativa) and soybean (Glycine max), with calculated N e ranging from 20 to 60. Juma R [47] estimated the N e in rice to be 22 using an elite core panel comprised of 72 lines, but N e may have been underestimated due to limited marker information used in the analysis.Similar studies in rice also had the same range of N e , with calculated values ranging from 23-57 and 40-60; these were estimated based on breeding populations from recurrent selection programs [51] and pedigree data [52].The estimated N e of USDA set was within the range of N e values reported in studies conducted on other crops.In soybean, Xavier A [53] estimated N e for the USDA soybean germplasm collection comprised of 19,652 accessions from Bandillo N [54] and reported it to be 106 individuals.Recent studies have shown that soybean possess several genetic bottlenecks [55] and its genetic diversity has been reduced [56,57].The N e estimate of USDA set is relatively higher than soybean, implying greater diversity.Zhao Y [58] estimated N e in wild rice using 11 Chinese Oryza rufipogon populations including 32 landraces and reported it between 96-158, which is in a similar range to the USDA set.Thus, the N e of USDA set offers greater potential for adaptation, maintaining rare alleles, population stability, and reduced risk for inbreeding.
The results of our study also suggest that the use of GBS holds good potential for making inferences of N e regardless of the germplasm type.Using GBS-based markers, we approximated the LD pattern within and across chromosomes of both germplasms and then used the LD information for estimation of N e .Genomewide LD (r 2 ) of the USDA set decayed from lower LD at 200 kb, while the NDSU set had the highest LD declined at a longer distance of around 250 kb.These results provided consistency of higher genetic variations of the former over the latter.Similar LD findings have been observed in previous studies conducted on peas, wherein both wild and spring peas exhibited a decay distance of approximately 200 kb, whereas wild/ landrace peas were around 100 kb [49] which is a bit lower than the USDA set.Comparing the LD of USDA set and the NDSU set to other selfing crops such as rice, soybeans, and barley, the physical distances found were more or less similar depending on the populations.For instance, Huang X [59] estimated LD using O. indica and O. japonica landraces of rice at 123 and 167 kb, respectively, with r 2 declining to 0.25 and 0.28.Additionally, soybean landraces extended from 90 to 500 kb [60] while improved cultivars hit 133 kb [61] which is similar to the USDA set.Alternatively, a recent LD analysis from soybean USDA germplasm revealed that the r 2 dropped intragenically within a few kilobases [61] and the one in barley's landraces hit 90 kb [62], both shorter than the USDA set.The LD-decay of the NDSU set was also found to be in a similar range with elite varieties of barley which extended to at least 212 kb [62] and O. japonica elite lines at ~ 318 kb [63], but had a higher distance compared to O. indica elite lines (~ 124 kb) [63].The LD-decay rate of a crop does depend on the genetic background of the populations being studied, and it can be affected due to mutations, genetic drift, non-random mating, and a small N e [64].
Effective population size helps breeders preserve and remodel their selection strategies to enhance the stability and variability in their breeding populations [10].Breeders can also implement marker-based mating experiments known as optimum contribution selection (OCS) [47] in order to maintain diversity in selection candidates for long-term gain.As pulse crop breeders navigate through challenges in their breeding programs, the information from this study provides valuable insights by demonstrating the strength of contemporary populations and possibly contributing to the long-term goal of increasing genetic gain while maintaining diversity in breeding programs.

Conclusions
We provided insights of effective population size (N e ) in field pea which can guide plant breeders to actively monitor N e in successive cycles of breeding to sustain viability of the breeding efforts in the long term.Our estimations revealed that the N e of USDA set (174) was larger than the NDSU set (64), providing insights into the extent of inbreeding and available genetic diversity in both germplasm pool.For future estimation of N e , researchers could incorporate additional biological information (e.g., gene expression, metabolomics, etc.) along with DNA markers and demographic history, that will likely increase the understanding of plant breeders regarding the population dynamics and potential for adaptation to different everchanging environments.

Fig. 1
Fig. 1 Genome-wide linkage disequilibrium-decay of NDSU set and USDA set

Fig. 2
Fig. 2 Chromosome-wide linkage disequilibrium-decay of NDSU set and USDA set

Fig. 4
Fig. 4 Estimated effective population size (N e ) for NDSU set is 64 and USDA set is 174